Summary

This presents the renewed analysis of Cryptococcus neoformans start codon usage and context. This uses the best-transcript annotation and corresponding start codon position and sequence map made by Corinne Maufrais in June 2018.

It covers both JEC21 and H99 data. First several analyses on JEC21, then the same analyses on H99, then a joint analysis of signals conserved across both strains.

We check consensus sequences for both “narrow” (-4 NNNNATG) and “wide” (-10 NNNNNNNNNNATG) neighbourhoods of the start codon, and find essentially the same results with both, comparing annotated aATGs to downstream dATGs. Then for the following analyses we use mostly the wide score.

Notes

Generalized additive model smooths use thin plate regression spline with k=4 basis dimension, from the mgcv package (Wood S.N. (2017) Generalized Additive Models).

Load Packages

JEC21 first

Expression: RNA abundance and ribosome-protected-fragments

Load expression data

## # A tibble: 6,634 x 4
## # Groups:   Gene [6,634]
##    Gene        RNA    RPF    TE
##    <chr>     <dbl>  <dbl> <dbl>
##  1 CNM01300  3981. 18260. 4.59 
##  2 CNM01080  8422.  8957. 1.06 
##  3 CNA07570  5811.  7134. 1.23 
##  4 CNG04360  3171.  7048. 2.22 
##  5 CNB02360  3764.  6958. 1.85 
##  6 CNA06350 15019.  6591. 0.439
##  7 CNC00700  2344.  6224. 2.65 
##  8 CNF03840 11321.  6175. 0.545
##  9 CNF02150 15611.  6144. 0.394
## 10 CNF03160  5094.  6114. 1.20 
## # ... with 6,624 more rows

We also calculated hiTrans_JEC21, the top 5% (330) translated genes by RPF TPM.

Ribosome occupancy mostly tracks RNA abundance

ATG Context

Load context data

## # A tibble: 6,636 x 19
##    Gene  aATG.context aATG.pos d1.context d1.posTSS d1.posATG d1.frame
##    <chr> <chr>           <dbl> <chr>          <dbl>     <dbl>    <dbl>
##  1 CNA0… GACCCCCTTGT…       93 ATAGCTGGT…       226       133        1
##  2 CNA0… ATATTGCCTGA…      102 GTCCACCTT…       163        61        1
##  3 CNA0… GAACTATCAAG…      214 GAGGCTCCG…       512       298        1
##  4 CNA0… ATTTTCAACAG…       81 AGCAATATA…       307       226        1
##  5 CNA0… ACCGTGCACAC…       76 GTATTCGGG…       106        30        0
##  6 CNA0… AATCATACCAA…      117 GCCCCTATC…       186        69        0
##  7 CNA0… CCGACTATAAA…       52 AACCGTGCT…       112        60        0
##  8 CNA0… CTTTCTCTTCA…       77 TGCTATAGC…        98        21        0
##  9 CNA0… TAATCACACAA…      330 CTCATCATC…       391        61        1
## 10 CNA0… AAAAAAAACGC…      146 ACTTGTCGA…       184        38        2
## # ... with 6,626 more rows, and 12 more variables: d2.context <chr>,
## #   d2.posTSS <dbl>, d2.posATG <dbl>, d2.frame <dbl>, u1.context <chr>,
## #   u1.posTSS <dbl>, u1.posATG <dbl>, u1.frame <dbl>, u2.context <chr>,
## #   u2.posTSS <dbl>, u2.posATG <dbl>, u2.frame <dbl>

Annotated ATGs have a Kozak consensus sequence

Highly translated Annotated ATGs have a Kozak consensus sequence

That’s for hiTrans_JEC21, the top 5% (330) translated genes by RPF TPM.

Cytoplasmic Ribosome Annotated ATGs have a Kozak consensus sequence

All cytoribo genes are highly translated

Venn diagram

Upstream ATGs don’t have a consensus

First upstream ATG.

Downstream ATGs don’t have a consensus

First downstream ATG

Downstream ATGs in frame and highly translated don’t have a consensus

Except for 3rd-codon-position bias.

Calculate Information content and scores of consensus motif

Calculate a wide and a narrow consensus sequence

Calculate motif score against the position weight matrix (pwm) for both narrow (-4 from ATG through to ATG) and wide (-10 from ATG to ATG) kozak consensus motif. These motifs are taken from the top 5% highly translated genes.

Estimate the information content

Using the sequence logo, details on https://en.wikipedia.org/wiki/Sequence_logo.

This is equal to the total height of the letters in the sequence logo summed across multiple positions.

## # A tibble: 6 x 4
##   Genes    ATG   Width  Infon
##   <chr>    <chr> <chr>  <dbl>
## 1 All      aATG  narrow 1.03 
## 2 HiTrans  aATG  narrow 2.84 
## 3 CytoRibo aATG  narrow 4.01 
## 4 All      d1ATG narrow 0.131
## 5 HiTrans  d1ATG narrow 0.307
## 6 CytoRibo d1ATG narrow 0.517

Information content in bits of highly-translated consensus (excluding 6 bits from ATG), narrow is 2.84, of wide is 3.81.

Estimate information content per base across start context

This is equal to the total height of the letters in the sequence logo at each position. It could be useful in comparing the total information without getting overly distracted by the actual letters.

Calculate scores of aATG, dATG, uATG against Kozak consensus

We calculate scores using Biostrings::PWMscoreStartingAt.

The best description I could find of this method is: https://support.bioconductor.org/p/61520/

It is just the sum of the matrix product of the PWM with the sequence.

Write scores to file scores_kozak_JEC21.txt.

## # A tibble: 6,636 x 13
##    Gene  aATG.scorekn d1.scorekn d2.scorekn u1.scorekn aATG.scorekw
##    <chr>        <dbl>      <dbl>      <dbl>      <dbl>        <dbl>
##  1 CNA0…        0.725      0.756      0.899      0.966        0.744
##  2 CNA0…        0.805      0.932      0.727      0.862        0.786
##  3 CNA0…        0.872      0.683      0.823      0.805        0.885
##  4 CNA0…        0.895      0.795      0.837     NA            0.858
##  5 CNA0…        0.964      0.741      0.839     NA            0.866
##  6 CNA0…        1.         0.884      0.698     NA            0.943
##  7 CNA0…        0.977      0.803      0.763     NA            0.919
##  8 CNA0…        0.781      0.964      0.758      0.733        0.769
##  9 CNA0…        0.940      0.920      0.803     NA            0.929
## 10 CNA0…        0.851      0.816      0.781     NA            0.862
## # ... with 6,626 more rows, and 7 more variables: d1.scorekw <dbl>,
## #   d2.scorekw <dbl>, u1.scorekw <dbl>, d1vsan <dbl>, u1vsan <dbl>,
## #   d1vsaw <dbl>, u1vsaw <dbl>

Plot against narrow consensus (-4 to ATG)

Narrow consensus comparing d1ATG to d2AUG score

Narrow consensus d1AUG score by frame

Same again, segregated by expression (enoughRNA, top 50%)

Plot against wide consensus (-10 to ATG)

Wide consensus comparing d1ATG to d2AUG score

Mutual information of different positions around ATG

For all annotated genes

For highly translated genes

For cytoplasmic ribosomal proteins

This illustrates that the MI between pairs of nts is generally weak. Except for the nts sharing a codon in the +4 to +12 positions. And secondarily at the -6 to -4 positions.

What is the -5 correlation in hiTrans genes?

What is the -5 correlation in cyto ribo genes?

So there is a strong tendency to have a C at -5 if there is a T at -6.

What about the -4 correlation in hiTrans genes?

What about the -4 correlation in cyto ribo genes?

This is very interesting: essentially everything with a -4A has a -2A and -1A, but a -4C is more relaxed.

What about the -1 correlation in hiTrans genes?

What about the -1 correlation in cyto ribo genes?

This looks interesting and needs a better way of summarizing.

Compare aATG and dATG context by gene

Most dAUG scores are less than aAUG scores

Most u1AUG scores are less than aAUG scores

For highly translated genes, most dAUG scores are much less than aATG

Red: high dATG vs aATG Kozak score. Blue: highly translated. Purple: both.

Small negative correlation between dATG and aAUG score

Rnarrow = -0.07; Rwide = -0.054

There may be some trend but it is weak

Boxplots show enough_RNA only.

Genes with unusual dATG vs aAUG score

The narrow score genes are in this list:

## # A tibble: 330 x 3
##    Gene     aATG.scorekn d1.scorekn
##    <chr>           <dbl>      <dbl>
##  1 CNA01530        0.673      1.   
##  2 CNB00520        0.683      0.989
##  3 CNI00670        0.698      1.   
##  4 CNK00900        0.707      1.   
##  5 CND02465        0.725      1.   
##  6 CNI00690        0.673      0.945
##  7 CNG04505        0.730      1.   
##  8 CNA00760        0.733      1.   
##  9 CNI00340        0.706      0.966
## 10 CNB04570        0.730      0.989
## # ... with 320 more rows

dATG in frame with ATG, narrow

Files with high difference in narrow score, filtered for reasonable amounts of RNA, in frame. Saved to dvsaATG_highdiffn_inframe_JEC21.txt.

## # A tibble: 87 x 3
##    Gene     aATG.scorekn d1.scorekn
##    <chr>           <dbl>      <dbl>
##  1 CNA01530        0.673      1.   
##  2 CNI00670        0.698      1.   
##  3 CNI00690        0.673      0.945
##  4 CNA00760        0.733      1.   
##  5 CNB04570        0.730      0.989
##  6 CNF02520        0.673      0.925
##  7 CNH00360        0.740      0.989
##  8 CNG01740        0.742      0.989
##  9 CNN00820        0.699      0.945
## 10 CNB01775        0.756      1.   
## # ... with 77 more rows

dATG out of frame with ATG, narrow

Files with high difference in narrow score, filtered for reasonable amounts of RNA, out of frame. Saved to dvsaATG_highdiffn_outframe_JEC21.txt.

## # A tibble: 52 x 3
##    Gene     aATG.scorekn d1.scorekn
##    <chr>           <dbl>      <dbl>
##  1 CNI00340        0.706      0.966
##  2 CNC07140        0.761      1.   
##  3 CNN00160        0.759      0.989
##  4 CNB05380        0.758      0.977
##  5 CND06330        0.751      0.966
##  6 CND01650        0.707      0.922
##  7 CNG00490        0.792      1.   
##  8 CNK01150        0.761      0.952
##  9 CNC02170        0.811      1.   
## 10 CNE04530        0.811      1.   
## # ... with 42 more rows

dATG in frame with ATG, wide

Files with high difference in narrow score, filtered for reasonable amounts of RNA, in frame. Saved to dvsaATG_highdiffw_inframe_JEC21.txt.

## # A tibble: 86 x 3
##    Gene     aATG.scorekw d1.scorekw
##    <chr>           <dbl>      <dbl>
##  1 CNI00690        0.631      0.913
##  2 CNA04990        0.669      0.939
##  3 CNF02520        0.665      0.925
##  4 CNB01775        0.661      0.912
##  5 CNG01740        0.717      0.963
##  6 CNA05725        0.721      0.967
##  7 CNN00820        0.657      0.900
##  8 CNG01890        0.607      0.845
##  9 CNB04570        0.721      0.954
## 10 CNI00670        0.730      0.961
## # ... with 76 more rows

dATG out of frame with ATG, wide

Files with high difference in narrow score, filtered for reasonable amounts of RNA, out of frame. Saved to dvsaATG_highdiffw_outframe_JEC21.txt.

## # A tibble: 54 x 3
##    Gene     aATG.scorekw d1.scorekw
##    <chr>           <dbl>      <dbl>
##  1 CNI00340        0.665      0.953
##  2 CNC07140        0.664      0.912
##  3 CNE04530        0.757      0.980
##  4 CNN00160        0.721      0.930
##  5 CNN01750        0.759      0.953
##  6 CNG00500        0.736      0.928
##  7 CND01020        0.778      0.965
##  8 CNG02120        0.770      0.945
##  9 CND01650        0.727      0.902
## 10 CNF01580        0.724      0.898
## # ... with 44 more rows

dATG vs aATG ribosome occupancy depends on the context

For top 3315 / 50% of genes by mean RNA TPM.

dATG vs aATG ribosome occupancy depends on the context, geometric mean across reps

There is slight enrichment in high-score dATGs in frame near the aATG

Compare score difference to localization predictions

Load predictions from mitofates

In input file JEC21_mitofates.txt.

Genes with high dATG vs aAUG score are enriched in mitochondrial presequences

There are spectacular correlations between d1 score, d1 frame, and localization

Count dATG vs AUG score, d1 frame, mito pre, enough RNA

## # A tibble: 16 x 5
## # Groups:   enoughR, d1vsaw0p1, d1.framefac [?]
##    enoughR d1vsaw0p1 d1.framefac Pred_preseq     n
##    <fct>   <fct>     <fct>       <fct>       <int>
##  1 Yes     d1lo      In          No            774
##  2 Yes     d1lo      In          Yes           123
##  3 Yes     d1lo      Out         No           2031
##  4 Yes     d1lo      Out         Yes           219
##  5 Yes     d1hi      In          No             52
##  6 Yes     d1hi      In          Yes            37
##  7 Yes     d1hi      Out         No             50
##  8 Yes     d1hi      Out         Yes            10
##  9 No      d1lo      In          No            826
## 10 No      d1lo      In          Yes            62
## 11 No      d1lo      Out         No           2125
## 12 No      d1lo      Out         Yes            89
## 13 No      d1hi      In          No            101
## 14 No      d1hi      In          Yes            13
## 15 No      d1hi      Out         No             96
## 16 No      d1hi      Out         Yes             6

However, mito-localized genes do not have a distinctive aATG context

It’s just a subset: the dual-localized ones.

Negative correlation of dATG and aAUG score only for mitochondrial localization signal

Mitofates and d1ATG/d2ATG wide score

uATGs inhibit translation of the main ORF

uATGs are associated with lower absolute translation

uATGs are associated with lower translation efficiency

uATGs associated with lower translation efficiency are over 20nt from TSS

uAUG score weakly affects TE

We suspect that uATG is associated with lower TE if the uATG has

  • position at least 20nt downstream from TSS
  • higher score

This figure shows that, for genes with only 1 uATG, this correlation is weak.

List of genes with low TE and uATGs far from TSS

Check these for ribosome occupancy at uATG.

## # A tibble: 0 x 8
## # Groups:   Gene [?]
## # ... with 8 variables: Gene <chr>, RNA <dbl>, RPF <dbl>, TE <dbl>,
## #   uATGCt <int>, uATGCtmin20 <int>, u1.cxtn <chr>, u2.cxtn <chr>
## # A tibble: 7 x 8
## # Groups:   Gene [?]
##   Gene       RNA   RPF     TE uATGCt uATGCtmin20 u1.cxtn  u2.cxtn 
##   <chr>    <dbl> <dbl>  <dbl>  <int>       <int> <chr>    <chr>   
## 1 CNA07610  50.0  5.85 0.117       1           1 TCCGTATG <NA>    
## 2 CNF00330 182.   3.20 0.0176      8           8 AAAAAATG CAAAAATG
## 3 CNG00290  58.3  4.38 0.0751      1           1 GCAGGATG <NA>    
## 4 CNG04240 123.   5.50 0.0446      0           0 <NA>     <NA>    
## 5 CNH02210  42.4  1.68 0.0396      1           1 CCACAATG <NA>    
## 6 CNL04930 203.  17.3  0.0853      2           2 CGACAATG ACTTTATG
## 7 CNM02470 171.  17.0  0.0993      2           2 CCAGAATG CCATCATG

uATG vs aATG ribosome occupancy depends on the context, narrow

For top 3315 / 50% of genes by mean RNA TPM.

uATG vs aATG ribosome occupancy depends on the context, wide all reps

For top 3315 / 50% of genes by mean RNA TPM, summarized by gene, both samples.

uATG vs aATG ribosome occupancy depends on the context, wide summarized

For top 3315 / 50% of genes by mean RNA TPM, with only a single uATG, summarized by gene, median across 4 samples.

Back to table of contents

H99 second

Expression: RNA abundance and ribosome-protected-fragments

Load expression data

## # A tibble: 6,790 x 4
## # Groups:   Gene [6,790]
##    Gene          RNA    RPF    TE
##    <chr>       <dbl>  <dbl> <dbl>
##  1 CNAG_06125 10270. 20140. 1.96 
##  2 CNAG_06101  8775.  8494. 0.968
##  3 CNAG_05762  7529.  7499. 0.996
##  4 CNAG_00779  3896.  7432. 1.91 
##  5 CNAG_03127  6254.  7164. 1.15 
##  6 CNAG_06222  6631.  6772. 1.02 
##  7 CNAG_04011 13306.  6772. 0.509
##  8 CNAG_01455 12670.  6548. 0.517
##  9 CNAG_05525  6970.  6461. 0.927
## 10 CNAG_03739  6515.  6383. 0.980
## # ... with 6,780 more rows

We also calculated hiTrans_H99, the top 5% (330) translated genes by RPF TPM.

Ribosome occupancy mostly tracks RNA abundance

ATG Context

Load context data

## # A tibble: 6,791 x 19
##    Gene  aATG.context aATG.pos d1.context d1.posTSS d1.posATG d1.frame
##    <chr> <chr>           <dbl> <chr>          <dbl>     <dbl>    <dbl>
##  1 CNAG… TACTTACGCGA…       70 AAATTCACT…       100        30        0
##  2 CNAG… GAACTTCGATC…       52 TCTCCCGCC…       114        62        2
##  3 CNAG… TGTCTCCTTGA…      104 ACTTACGCC…       189        85        1
##  4 CNAG… CACATACGTAA…      214 CCGAACGGC…       256        42        0
##  5 CNAG… GACTATACAAA…       55 GGAGGTGGG…       163       108        0
##  6 CNAG… AACCATACAAA…       99 CAAAGCCAT…       259       160        1
##  7 CNAG… ACCGTGCACAC…       75 GTATTCGGA…       105        30        0
##  8 CNAG… GTTTTCAACAG…       73 CCCATCAGA…       380       307        1
##  9 CNAG… GTACTATTGAA…      206 GAGGCTCCG…       513       307        1
## 10 CNAG… TACAAGCTTGA…       90 GGCCGCCTT…       151        61        1
## # ... with 6,781 more rows, and 12 more variables: d2.context <chr>,
## #   d2.posTSS <dbl>, d2.posATG <dbl>, d2.frame <dbl>, u1.context <chr>,
## #   u1.posTSS <dbl>, u1.posATG <dbl>, u1.frame <dbl>, u2.context <chr>,
## #   u2.posTSS <dbl>, u2.posATG <dbl>, u2.frame <dbl>

Annotated ATGs have a Kozak consensus sequence

Highly translated Annotated ATGs have a Kozak consensus sequence

That’s for hiTrans_H99, the top 5% (330) translated genes by RPF TPM.

Cytoplasmic Ribosome Annotated ATGs have a consensus sequence

Ideally would fix this more nicely.

All cytoribo genes are highly translated

Venn diagram

Upstream ATGs don’t have a consensus

First upstream ATG.

Downstream ATGs don’t have a consensus

First downstream ATG

Downstream ATGs in frame and highly translated don’t have a consensus

Except for 3rd-codon-position bias.

Calculate Information content and scores of consensus motif

Calculate a wide and a narrow consensus sequence

Calculate motif score against the position weight matrix (pwm) for both narrow (-4 from ATG through to ATG) and wide (-10 from ATG to ATG) kozak consensus motif. These motifs are taken from the top 5% highly translated genes.

Estimate the information content

Using the sequence logo, details on https://en.wikipedia.org/wiki/Sequence_logo.

This is equal to the total height of the letters in the sequence logo summed across multiple positions.

## # A tibble: 6 x 4
##   Genes    ATG   Width  Infon
##   <chr>    <chr> <chr>  <dbl>
## 1 All      aATG  narrow 0.944
## 2 HiTrans  aATG  narrow 2.96 
## 3 CytoRibo aATG  narrow 4.16 
## 4 All      d1ATG narrow 0.116
## 5 HiTrans  d1ATG narrow 0.233
## 6 CytoRibo d1ATG narrow 0.492

Information content in bits of highly-translated consensus (excluding 6 bits from ATG), narrow is 2.96, of wide is 3.88.

Estimate information content per base across start context

Calculate scores of aATG, dATG, uATG against Kozak consensus

Write scores to file scores_kozak_H99.txt.

## # A tibble: 6,791 x 13
##    Gene  aATG.scorekn d1.scorekn d2.scorekn u1.scorekn aATG.scorekw
##    <chr>        <dbl>      <dbl>      <dbl>      <dbl>        <dbl>
##  1 CNAG…        0.904      0.776      0.747      0.847        0.849
##  2 CNAG…        0.871      0.841      1         NA            0.817
##  3 CNAG…        0.786      0.849      0.720     NA            0.769
##  4 CNAG…        0.920      0.785      0.788      0.849        0.866
##  5 CNAG…        0.978      0.759      0.822     NA            0.929
##  6 CNAG…        0.978      0.704      0.763     NA            0.928
##  7 CNAG…        0.966      0.856      0.849     NA            0.873
##  8 CNAG…        0.878      0.834      0.978     NA            0.851
##  9 CNAG…        0.891      0.787      0.821      0.821        0.900
## 10 CNAG…        0.874      0.935      0.662      0.678        0.825
## # ... with 6,781 more rows, and 7 more variables: d1.scorekw <dbl>,
## #   d2.scorekw <dbl>, u1.scorekw <dbl>, d1vsan <dbl>, u1vsan <dbl>,
## #   d1vsaw <dbl>, u1vsaw <dbl>

Plot against narrow consensus (-4 to ATG)

Plot against wide consensus (-10 to ATG)

Wide consensus comparing d1ATG to d2AUG score

Mutual information of different positions around ATG

For all annotated genes

For highly translated genes

For cytoplasmic ribosomal proteins

This illustrates that the MI between pairs of nts is generally weak. Except for the nts sharing a codon in the +4 to +12 positions. And secondarily at the -6 to -4 positions.

What is the -5 correlation in hiTrans genes?

What is the -5 correlation in cyto ribo genes?

So there is a strong tendency to have a C at -5 if there is a T at -6.

What about the -4 correlation in cyto ribo genes?

This is uninteresting because the counts are so low with no -4A. The -3A is even worse.

Compare aATG and dATG context by gene

Most dAUG scores are less than aAUG scores

Most u1AUG scores are less than aAUG scores

For highly translated genes, most dAUG scores are much less than aATG

Filtered for enough RNA!

Small negative correlation between dATG and aAUG score

Rnarrow = -0.058; Rwide = -0.036

There may be some trend but it is weak

Boxplots show enough_RNA only.

Genes with unusual dATG vs aAUG score

The narrow score genes are in this list:

## # A tibble: 330 x 3
##    Gene       aATG.scorekn d1.scorekn
##    <chr>             <dbl>      <dbl>
##  1 CNAG_04764        0.633      0.990
##  2 CNAG_04147        0.636      0.990
##  3 CNAG_00165        0.662      1    
##  4 CNAG_07473        0.633      0.968
##  5 CNAG_07801        0.662      0.978
##  6 CNAG_02259        0.662      0.978
##  7 CNAG_07776        0.688      1    
##  8 CNAG_06751        0.691      0.990
##  9 CNAG_04179        0.696      0.990
## 10 CNAG_01092        0.675      0.966
## # ... with 320 more rows

dATG in frame with ATG, narrow

Files with high difference in narrow score, filtered for reasonable amounts of RNA, in frame. Saved to dvsaATG_highdiffn_inframe_H99.txt.

## # A tibble: 97 x 3
##    Gene       aATG.scorekn d1.scorekn
##    <chr>             <dbl>      <dbl>
##  1 CNAG_04764        0.633      0.990
##  2 CNAG_04147        0.636      0.990
##  3 CNAG_00165        0.662      1    
##  4 CNAG_07473        0.633      0.968
##  5 CNAG_07801        0.662      0.978
##  6 CNAG_02259        0.662      0.978
##  7 CNAG_07776        0.688      1    
##  8 CNAG_04179        0.696      0.990
##  9 CNAG_01092        0.675      0.966
## 10 CNAG_00409        0.636      0.920
## # ... with 87 more rows

dATG out of frame with ATG, narrow

Files with high difference in narrow score, filtered for reasonable amounts of RNA, out of frame. Saved to dvsaATG_highdiffn_outframe_H99.txt.

## # A tibble: 52 x 3
##    Gene       aATG.scorekn d1.scorekn
##    <chr>             <dbl>      <dbl>
##  1 CNAG_06278        0.730      0.990
##  2 CNAG_04264        0.721      0.978
##  3 CNAG_04054        0.750      0.978
##  4 CNAG_03665        0.789      1    
##  5 CNAG_06627        0.770      0.978
##  6 CNAG_01999        0.733      0.938
##  7 CNAG_06006        0.747      0.938
##  8 CNAG_03811        0.790      0.978
##  9 CNAG_06344        0.752      0.935
## 10 CNAG_07979        0.818      1    
## # ... with 42 more rows

dATG in frame with ATG, wide

Files with high difference in narrow score, filtered for reasonable amounts of RNA, in frame. Saved to dvsaATG_highdiffw_inframe_H99.txt.

## # A tibble: 102 x 3
##    Gene       aATG.scorekw d1.scorekw
##    <chr>             <dbl>      <dbl>
##  1 CNAG_04147        0.611      0.929
##  2 CNAG_02259        0.641      0.953
##  3 CNAG_04764        0.616      0.905
##  4 CNAG_07801        0.682      0.945
##  5 CNAG_04179        0.691      0.947
##  6 CNAG_07776        0.712      0.967
##  7 CNAG_07473        0.639      0.893
##  8 CNAG_03953        0.687      0.940
##  9 CNAG_03486        0.700      0.951
## 10 CNAG_05722        0.652      0.901
## # ... with 92 more rows

dATG out of frame with ATG, wide

Files with high difference in narrow score, filtered for reasonable amounts of RNA, out of frame. Saved to dvsaATG_highdiffw_outframe_H99.txt.

## # A tibble: 44 x 3
##    Gene       aATG.scorekw d1.scorekw
##    <chr>             <dbl>      <dbl>
##  1 CNAG_06278        0.693      0.931
##  2 CNAG_03811        0.752      0.966
##  3 CNAG_03370        0.738      0.941
##  4 CNAG_06006        0.670      0.863
##  5 CNAG_06782        0.704      0.895
##  6 CNAG_01675        0.646      0.827
##  7 CNAG_06508        0.756      0.933
##  8 CNAG_07680        0.792      0.968
##  9 CNAG_00973        0.791      0.963
## 10 CNAG_01999        0.705      0.876
## # ... with 34 more rows

dATG vs aATG ribosome occupancy depends on the context

For top 3315 / 50% of genes by mean RNA TPM.

dATG vs aATG ribosome occupancy and narrow score, geometric mean across reps

dATG vs aATG ribosome occupancy and wide score, geometric mean across reps

Compare score difference to localization predictions

Load predictions from mitofates

In input file H99_mitofates.txt.

Genes with high dATG vs aAUG score are enriched in mitochondrial presequences

There are spectacular correlations between d1 score, d1 frame, and localization

Count dATG vs AUG score, d1 frame, mito pre, enough RNA

## # A tibble: 16 x 5
## # Groups:   enoughR, d1vsaw0p1, d1.framefac [?]
##    enoughR d1vsaw0p1 d1.framefac Pred_preseq     n
##    <fct>   <fct>     <fct>       <fct>       <int>
##  1 Yes     d1lo      In          No            752
##  2 Yes     d1lo      In          Yes           120
##  3 Yes     d1lo      Out         No           2018
##  4 Yes     d1lo      Out         Yes           241
##  5 Yes     d1hi      In          No             69
##  6 Yes     d1hi      In          Yes            41
##  7 Yes     d1hi      Out         No             53
##  8 Yes     d1hi      Out         Yes             4
##  9 No      d1lo      In          No            906
## 10 No      d1lo      In          Yes            55
## 11 No      d1lo      Out         No           2222
## 12 No      d1lo      Out         Yes            81
## 13 No      d1hi      In          No             98
## 14 No      d1hi      In          Yes            10
## 15 No      d1hi      Out         No             91
## 16 No      d1hi      Out         Yes             7

However, mito-localized genes do not have a distinctive aATG context

It’s just a subset: the dual-localized ones.

Alternative Localization predictions from DeepLoc 1.0

Load predictions from DeepLoc

In input file H99_DeepLoc.txt.

Compare mitofates and DeepLoc mitochondrial predictions

Compare mitofates and DeepLoc plastid predictions

There are correlations between d1 score, d1 frame, and DeepLoc predictions

Signal peptides predicted by SignalP

Load predictions

In input file H99_SignalP.txt.

## # A tibble: 15 x 5
## # Groups:   enoughRNA, d1vsawfac, d1.framefac [?]
##    enoughRNA  d1vsawfac                d1.framefac SignalP     n
##    <fct>      <fct>                    <fct>       <chr>   <int>
##  1 enough RNA "AUG score\nd < a + 0.1" In          N         836
##  2 enough RNA "AUG score\nd < a + 0.1" In          Y          36
##  3 enough RNA "AUG score\nd < a + 0.1" Out         N        2148
##  4 enough RNA "AUG score\nd < a + 0.1" Out         Y         111
##  5 enough RNA "AUG score\na + 0.1 < d" In          N         107
##  6 enough RNA "AUG score\na + 0.1 < d" In          Y           3
##  7 enough RNA "AUG score\na + 0.1 < d" Out         N          57
##  8 not enough "AUG score\nd < a + 0.1" In          N         909
##  9 not enough "AUG score\nd < a + 0.1" In          Y          52
## 10 not enough "AUG score\nd < a + 0.1" Out         N        2163
## 11 not enough "AUG score\nd < a + 0.1" Out         Y         141
## 12 not enough "AUG score\na + 0.1 < d" In          N         105
## 13 not enough "AUG score\na + 0.1 < d" In          Y           3
## 14 not enough "AUG score\na + 0.1 < d" Out         N          96
## 15 not enough "AUG score\na + 0.1 < d" Out         Y           2

Compare SignalP and DeepLoc plastid predictions

There are correlations between d1 score, d1 frame, and SignalP predictions

uATGs inhibit translation of the main aORF

uATGs are associated with lower absolute translation

uATGs are associated with lower translation efficiency

uATGs associated with lower translation efficiency are over 20nt from TSS

uAUG score weakly affects TE

We suspect that uATG is associated with lower TE if the uATG has

  • position at least 20nt downstream from TSS
  • higher score

This figure shows that, for genes with only 1 uATG, this correlation is weak.

List of genes with low TE and uATGs far from TSS

Many of these (CNAG_03140, CNAG_07695, CNAG_06246) are strongly translationally repressed and have good context at the uATG.

## # A tibble: 0 x 8
## # Groups:   Gene [?]
## # ... with 8 variables: Gene <chr>, RNA <dbl>, RPF <dbl>, TE <dbl>,
## #   uATGCt <int>, uATGCtmin20 <int>, u1.cxtn <chr>, u2.cxtn <chr>
## # A tibble: 5 x 8
## # Groups:   Gene [?]
##   Gene         RNA   RPF     TE uATGCt uATGCtmin20 u1.cxtn  u2.cxtn 
##   <chr>      <dbl> <dbl>  <dbl>  <int>       <int> <chr>    <chr>   
## 1 CNAG_00784  52.8  6.69 0.127       1           0 TCCGTATG <NA>    
## 2 CNAG_03578  43.5  6.20 0.143       1           1 GCAGGATG <NA>    
## 3 CNAG_05574  30.0  2.66 0.0885      1           1 CCACAATG <NA>    
## 4 CNAG_06246 196.  24.6  0.125       2           2 CCAGAATG CCATCATG
## 5 CNAG_07813 148.  20.1  0.136       1           1 CGGCAATG <NA>

uATG vs aATG ribosome occupancy depends on the context, narrow

For top 3315 / 50% of genes by mean RNA TPM.

uATG vs aATG ribosome occupancy depends on the context, wide all reps

For top 3315 / 50% of genes by mean RNA TPM, summarized by gene, all 4 samples.

uATG vs aATG ribosome occupancy depends on the context, wide summarized

For top 3315 / 50% of genes by mean RNA TPM, with only a single uATG, summarized by gene, median across 4 samples.

Back to table of contents

Results on conserved genes in H99 and JEC21.

Load list of paralogs

From 2016 Paper.

## # A tibble: 6,341 x 2
##    H99        JEC21   
##    <chr>      <chr>   
##  1 CNAG_01397 CND05080
##  2 CNAG_07825 CNH03545
##  3 CNAG_05539 CNH01890
##  4 CNAG_03635 CNB01365
##  5 CNAG_06621 CNF03970
##  6 CNAG_00830 CNA08090
##  7 CNAG_07556 CNK01100
##  8 CNAG_06796 CNB00060
##  9 CNAG_06009 CNM00180
## 10 CNAG_03522 CNG00710
## # ... with 6,331 more rows

Conservation of gene expression

RPF vs RNA with same colours

RNA Abundance

Ribosome Occupancy

Translation efficiency, no threshold

Translation efficiency, filtered by top 50% of expression

Genes with high translation

## # A tibble: 20 x 8
##    H99        JEC21    RNA.H99 RPF.H99 TE.H99 RNA.JEC21 RPF.JEC21 TE.JEC21
##    <chr>      <chr>      <dbl>   <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
##  1 CNAG_06125 CNM01300  10270.  20140.  1.96      3981.    18260.    4.59 
##  2 CNAG_06101 CNM01080   8775.   8494.  0.968     8422.     8957.    1.06 
##  3 CNAG_00779 CNA07570   3896.   7432.  1.91      5811.     7134.    1.23 
##  4 CNAG_03127 CNG04360   6254.   7164.  1.15      3171.     7048.    2.22 
##  5 CNAG_05762 CNF02150   7529.   7499.  0.996    15611.     6144.    0.394
##  6 CNAG_03739 CNB02360   6515.   6383.  0.980     3764.     6958.    1.85 
##  7 CNAG_06222 CNM02240   6631.   6772.  1.02      4689.     6013.    1.28 
##  8 CNAG_00655 CNA06350  12483.   6041.  0.484    15019.     6591.    0.439
##  9 CNAG_04011 CNB04930  13306.   6772.  0.509    19847.     5650.    0.285
## 10 CNAG_06633 CNF03840   9267.   6136.  0.662    11321.     6175.    0.545
## 11 CNAG_01332 CND04480   5923.   6076.  1.03      4638.     5976.    1.29 
## 12 CNAG_03015 CNC00700   4856.   5691.  1.17      2344.     6224.    2.65 
## 13 CNAG_04448 CNI01090   6654.   5950.  0.894     5345.     5891.    1.10 
## 14 CNAG_00640 CNA06200   7701.   5784.  0.751     4883.     6037.    1.24 
## 15 CNAG_00771 CNA07490   6955.   5860.  0.843     7040.     5959.    0.846
## 16 CNAG_04883 CNJ03110   4270.   5872.  1.38      5747.     5908.    1.03 
## 17 CNAG_04726 CNJ01560   8038.   6353.  0.790     6386.     5396.    0.845
## 18 CNAG_00672 CNA06500   9191.   6054.  0.659    14270.     5645.    0.396
## 19 CNAG_05525 CNH01770   6970.   6461.  0.927     4051.     5204.    1.28 
## 20 CNAG_03780 CNB02750   6868.   5685.  0.828     5180.     5878.    1.13

Genes with high TE

## # A tibble: 20 x 8
##    H99        JEC21    RNA.H99 RPF.H99 TE.H99 RNA.JEC21 RPF.JEC21 TE.JEC21
##    <chr>      <chr>      <dbl>   <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
##  1 CNAG_01130 CND02530    47.7   301.    6.30      32.7      297.     9.09
##  2 CNAG_01890 CNK02310   248.   1445.    5.84     279.      1901.     6.82
##  3 CNAG_06150 CNM01520   607.   3347.    5.51     539.      2928.     5.43
##  4 CNAG_02994 CNC06020    68.5   261.    3.81      31.1      218.     7.01
##  5 CNAG_01750 CNC02520   256.   1307.    5.10     311.      1655.     5.33
##  6 CNAG_01727 CNC02320   736.   3570.    4.85     706.      3747.     5.31
##  7 CNAG_01744 CNC02470   104.    315.    3.04      33.5      237.     7.07
##  8 CNAG_04327 CNI02220    44.7   201.    4.51      35.2      197.     5.58
##  9 CNAG_01117 CND02420   436.   2095.    4.81     439.      2275.     5.18
## 10 CNAG_05907 CNF00650    94.5   298.    3.15      66.3      450.     6.79
## 11 CNAG_04640 CNJ00800   217.    822.    3.80     159.       969.     6.10
## 12 CNAG_04313 CNI02360   204.    225.    1.10      30.8      254.     8.23
## 13 CNAG_07373 CNA06000    65.7   302.    4.59      77.8      354.     4.55
## 14 CNAG_05602 CNH02450   381.    675.    1.77      27.4      192.     7.00
## 15 CNAG_06840 CND06220  1196.   2878.    2.41     460.      2895.     6.29
## 16 CNAG_00136 CNA01230    46.4   197.    4.24      45.3      199.     4.40
## 17 CNAG_05884 CNF00890    79.8   294.    3.69      72.8      360.     4.95
## 18 CNAG_06208 CNM02070   251.    976.    3.89     231.       986.     4.27
## 19 CNAG_00992 CND01200   254.    889.    3.50     262.      1168.     4.47
## 20 CNAG_04659 CNJ00950    25.8    55.6   2.16      26.3      152.     5.80

Genes with low TE

To-do: Check which of these have uATGs.

## # A tibble: 20 x 8
##    H99        JEC21    RNA.H99 RPF.H99 TE.H99 RNA.JEC21 RPF.JEC21 TE.JEC21
##    <chr>      <chr>      <dbl>   <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
##  1 CNAG_07695 CNF00330   164.    5.69  0.0346     182.      3.20   0.0176 
##  2 CNAG_03140 CNG04240   187.    2.00  0.0107     123.      5.50   0.0446 
##  3 CNAG_05574 CNH02210    30.0   2.66  0.0885      42.4     1.68   0.0396 
##  4 CNAG_04855 CNJ02770    30.4   2.67  0.0879      87.9     6.69   0.0761 
##  5 CNAG_06614 CNF04050    41.8   4.39  0.105       57.8     4.57   0.0791 
##  6 CNAG_02323 CNE02240    39.1   3.59  0.0918      49.9     5.12   0.103  
##  7 CNAG_03578 CNG00290    43.5   6.20  0.143       58.3     4.38   0.0751 
##  8 CNAG_07813 CNL04930   148.   20.1   0.136      203.     17.3    0.0853 
##  9 CNAG_06246 CNM02470   196.   24.6   0.125      171.     17.0    0.0993 
## 10 CNAG_05319 CNH03140    35.4   0.602 0.0170      35.4     7.69   0.218  
## 11 CNAG_00784 CNA07610    52.8   6.69  0.127       50.0     5.85   0.117  
## 12 CNAG_08027 CNH02090    25.6   2.75  0.107       89.8    13.6    0.152  
## 13 CNAG_02433 CNE01240    38.9   6.78  0.174      114.     11.1    0.0970 
## 14 CNAG_00529 CNA05110    40.2   7.86  0.196       96.7     7.83   0.0809 
## 15 CNAG_05237 CNL03915    33.1   9.02  0.273       72.1     0.277  0.00384
## 16 CNAG_05288 CNH03430    56.4   9.02  0.160       70.0     8.83   0.126  
## 17 CNAG_02867 CNC04820    54.5   7.14  0.131       47.6     7.51   0.158  
## 18 CNAG_01624 CNC01375    34.5   5.07  0.147       30.4     4.36   0.143  
## 19 CNAG_05567 CNH02150    36.1   6.85  0.189       75.9     7.98   0.105  
## 20 CNAG_06782 CNB00170    53.7   8.43  0.157       59.8     9.21   0.154

Genes with dAUG score high relative to aAUG score

We take transcripts where the overall gene expression (RNA abundance in top 50%), the difference in score (dATG > aATG in top 5%), and the dATG frame are all conserved between H99 and JEC21.

dATG in frame with ATG

Saved to file dvsaATG_highdiffn_inframe_cc.txt.

## # A tibble: 47 x 6
##    H99        JEC21    a.skn.H99 d.skn.H99 a.skn.JEC21 d.skn.JEC21
##    <chr>      <chr>        <dbl>     <dbl>       <dbl>       <dbl>
##  1 CNAG_00165 CNA01530     0.662     1           0.673       1    
##  2 CNAG_07776 CNI00670     0.688     1           0.698       1    
##  3 CNAG_07473 CNB01880     0.633     0.968       0.655       0.888
##  4 CNAG_07801 CNL06190     0.662     0.978       0.673       0.899
##  5 CNAG_00086 CNA00760     0.725     1           0.733       1    
##  6 CNAG_04179 CNI03160     0.696     0.990       0.715       0.945
##  7 CNAG_07873 CNH00360     0.721     0.990       0.740       0.989
##  8 CNAG_05722 CNF02520     0.662     0.916       0.673       0.925
##  9 CNAG_06353 CNN00820     0.688     0.945       0.699       0.945
## 10 CNAG_03953 CNB04410     0.723     0.990       0.741       0.945
## 11 CNAG_02306 CNE02400     0.723     0.968       0.741       0.966
## 12 CNAG_02545 CNE00210     0.696     0.938       0.715       0.940
## 13 CNAG_02259 CNE02870     0.662     0.978       0.683       0.827
## 14 CNAG_01544 CNC06400     0.759     0.978       0.756       0.977
## 15 CNAG_03996 CNB04810     0.696     0.895       0.655       0.886
## 16 CNAG_04219 CNI03610     0.770     0.978       0.761       0.977
## 17 CNAG_02880 CNC04930     0.696     0.916       0.715       0.917
## 18 CNAG_03679 CNB01775     0.822     1           0.756       1    
## 19 CNAG_02431 CNE01260     0.759     1           0.763       0.940
## 20 CNAG_00517 CNA04990     0.728     0.938       0.736       0.940
## # ... with 27 more rows

dATG out of frame

Saved to file dvsaATG_highdiffn_outframe_cc.txt.

## # A tibble: 14 x 6
##    H99        JEC21    a.skn.H99 d.skn.H99 a.skn.JEC21 d.skn.JEC21
##    <chr>      <chr>        <dbl>     <dbl>       <dbl>       <dbl>
##  1 CNAG_06278 CNN00160     0.730     0.990       0.759       0.989
##  2 CNAG_04054 CNB05380     0.750     0.978       0.758       0.977
##  3 CNAG_02809 CNC04270     0.800     0.978       0.793       0.977
##  4 CNAG_03008 CNC06190     0.827     1           0.819       1    
##  5 CNAG_03370 CNG02120     0.795     0.968       0.786       0.966
##  6 CNAG_01667 CNC01780     0.835     0.990       0.828       0.989
##  7 CNAG_03190 CNG03790     0.812     0.966       0.809       0.964
##  8 CNAG_04896 CNJ03200     0.821     0.975       0.823       0.975
##  9 CNAG_02894 CNC05065     0.819     0.978       0.837       0.977
## 10 CNAG_02578 CNK00690     0.818     0.944       0.785       0.941
## 11 CNAG_01241 CND03590     0.747     0.878       0.749       0.895
## 12 CNAG_06899 CNK00290     0.836     0.975       0.845       0.975
## 13 CNAG_07780 CNI00090     0.871     1           0.861       1    
## 14 CNAG_03839 CNB03280     0.817     0.938       0.813       0.940
  • CNN00160 two-component-like sensor kinase TCO7
  • CNB05380 SUI1/eIF1, translation initiation factor.
  • CNC06190 has a domain conserved with eIF2.
  • CNG02120 freqenin calcium-binding protein, FRQ1 homolog
  • CNC05065/CNM00150/CNC04270/CNC06190/CNA07610/CND03900, all hypothetical or uncharacterized.
  • CNC01780 prenyltransferase, COQ2 homolog
  • CNK00690 regulation of meiosis-related, PCH2 homolog
  • CND03590 protein phosphatase I nuclear regulatory subunit, SDS22 homolog
  • CNI00090 farnesyltranstransferase, BTS1 homolog
  • CND03900 has a W2 eIF4-gamma/eIF5/eIF2-epsilon - like domain
  • CNB03280 mRNA transcription modulator, CCR4-NOT ubiquitin ligase subunit MOT2 homolog.

Genes with very different aAUG scores, JEC21 vs H99

Filtered for enough RNA (top 50%)

## # A tibble: 135 x 6
##    H99        JEC21    a.skn.H99 d.skn.H99 a.skn.JEC21 d.skn.JEC21
##    <chr>      <chr>        <dbl>     <dbl>       <dbl>       <dbl>
##  1 CNAG_04147 CNI02850     0.636     0.990       0.989       0.884
##  2 CNAG_01092 CND02180     0.675     0.966       0.964       0.917
##  3 CNAG_03486 CNG01060     0.691     0.966       0.964       0.813
##  4 CNAG_06196 CNM01950     0.721     0.662       0.989       0.733
##  5 CNAG_00690 CNA06680     0.704     0.895       0.964       0.683
##  6 CNAG_01188 CND03130     0.636     0.828       0.888       0.733
##  7 CNAG_06000 CNM00090     0.717     0.911       0.964       0.745
##  8 CNAG_07437 CND03020     0.720     0.966       0.964       0.888
##  9 CNAG_05678 CNF02960     0.675     0.906       0.914       0.929
## 10 CNAG_06446 CNN01710     0.762     1           1           0.658
## # ... with 125 more rows
## # A tibble: 135 x 6
##    H99        JEC21    a.skn.H99 d.skn.H99 a.skn.JEC21 d.skn.JEC21
##    <chr>      <chr>        <dbl>     <dbl>       <dbl>       <dbl>
##  1 CNAG_04488 CNI00690     0.945     0.968       0.673       0.945
##  2 CNAG_03968 CNB04570     0.990     0.757       0.730       0.989
##  3 CNAG_00529 CNA05110     0.953     0.803       0.698       0.892
##  4 CNAG_03410 CNG01740     0.990     0.730       0.742       0.989
##  5 CNAG_05720 CNF02540     1         0.746       0.756       1    
##  6 CNAG_04751 CNJ01820     0.966     0.854       0.725       0.964
##  7 CNAG_04899 CNJ03230     0.953     0.819       0.723       0.966
##  8 CNAG_04089 CNB05680     0.966     0.809       0.743       0.886
##  9 CNAG_05504 CNH01580     0.966     0.812       0.776       0.842
## 10 CNAG_05692 CNF02820     0.919     0.783       0.732       0.845
## # ... with 125 more rows

Many of these have the expected structure where homologs differ only at the N-terminus. There appears to be a swap between a near-ATG start codon, and a poor-context ATG, between the species.

Higher aAUG score in JEC21:

  • CNAG_04147/CNI02850, putative RNA helicase RRP3 homolog
  • CNAG_06000/CNM00090, putative glycoprotein
  • CNAG_03486/CNG01060, peptidylprolyl isomerase, Scer has CPR2/CPR5 homolgs localized to ER and vacuole.
  • CNAG_06353/CNN00820, tRNA pseudouridine synthase, exciting, CNAG_06353 has weak mito loc seq, Scer has PUS1/PUS2 nuc/mito homologs.
  • CNAG_06446/CNN01710, mitochondrial splicing suppressor Mss51 homolog, clear mito preseq in both.
  • CNAG_01092/CND02180, hypothetical protein conserved in crypto, mito preseq in both
  • CNAG_01188/CND03130, ATGATG start in H99, not interesting
  • CNAG_05678/CNF02960, putative transmembrane protein involved in ammonia production.
  • CNAG_07645/CNE03115 Autophagy protein Atg12, H99 CNAG_07645 has non-homologous annotated N-terminal extension that seems improbably.

Higher aAUG score in H99:

  • CNAG_00529/CNA05110, sulfite transporter, Scer Ssu1 homolog.
  • CNAG_03410/CNG01740, upstream ATG in JEC21 close to TSS, prob not real
  • CNAG_04089/CNB05680, again upstream ATG in JEC21 close to TSS, prob not real. Note that CNAG_03410/CNAG_04089 are paralogs
  • CNAG_04751/CNJ01820, hypothetical protein conserved in tremellomycetes.
  • CNAG_02703/CNK01910, hypothetical protein conserved in tremellomycetes.
  • CNAG_03638/CNB01390, identical ATG context in fungidb, check.
  • CNAG_05504/CNH01580, Iah1, pattern unclear, JEC21 just has poor start codons.
  • CNAG_05692/CNF02820, sphinganine kinase, Scer YSR3/LCB2 homolog.
  • CNAG_07609/CNC03180, putative RNA helicase, weak start context in JEC21 predicts less protein in cell. Or most translation initiation is from much later ATG? DDX51/DBP6 homolog.

These look like mostly misannotated in one strain, or not interesting. Is the upstream start codon in one strain actually used? Check for ribosome footprints and for other features (homology, mito localization seq). It would be nice to have an additional filter here.

Genes with different predicted mito localization, JEC21 vs H99

We looked if genes with different predicted mito localization in the two strains could have swapped poor ATG for near-ATG start codons. However we did not find good evidence for that. Maybe if talternative TSS’s are used.

Filtered for enough RNA (top 50%).

## # A tibble: 28 x 4
##    H99        JEC21    Prob_preseq.H99 Prob_preseq.JEC21
##    <chr>      <chr>              <dbl>             <dbl>
##  1 CNAG_07163 CNE02790           1                 0    
##  2 CNAG_04443 CNI01140           0.999             0    
##  3 CNAG_02354 CNE01960           0.997             0.064
##  4 CNAG_05511 CNH01640           0.995             0    
##  5 CNAG_04664 CNJ00980           0.984             0.056
##  6 CNAG_00427 CNA04120           0.979             0.063
##  7 CNAG_01522 CNC06600           0.966             0    
##  8 CNAG_07352 CNA03190           0.916             0.18 
##  9 CNAG_05058 CNL05610           0.859             0.113
## 10 CNAG_01145 CND02700           0.841             0.34 
## # ... with 18 more rows
## # A tibble: 29 x 4
##    H99        JEC21    Prob_preseq.H99 Prob_preseq.JEC21
##    <chr>      <chr>              <dbl>             <dbl>
##  1 CNAG_00304 CNA02885           0                 0.981
##  2 CNAG_02033 CNE05020           0.174             0.955
##  3 CNAG_03649 CNB01500           0.228             0.733
##  4 CNAG_03403 CNG01800           0.302             0.729
##  5 CNAG_02794 CNC04130           0.005             0.68 
##  6 CNAG_06328 CNN00560           0.349             0.661
##  7 CNAG_04539 CNI00150           0.371             0.655
##  8 CNAG_03540 CNG00570           0.292             0.65 
##  9 CNAG_03984 CNB04715           0.367             0.615
## 10 CNAG_04801 CNJ02280           0.316             0.601
## # ... with 19 more rows

There seem to be more with mito-gain in H99. OR with alternative non-ATG starts in JEC21.

  • CNAG_07163, mito membrane insertase OXA1. JEC21 CNE02790 downstream non-ATG initiation?
  • CNAG_04443, hydrolase homolog. Duplicate orthologs in pombe and nidulans suggest dual-localization. H99 CNAG_04443 has ribosomes only from d1ATG, JEC21 CNI01140 has ribosomes on uORF.
  • CNAG_02354, mito RPL2? JEC21 CNE01960 annotation questions as several in-frame ATGs. old aATG (new ext) has mito loc, H99 homology.
  • CNAG_05511, ATPase, MSP1 (mitochondrial sorting protease) paralog. JEC21 CNH01640, again 5’ annotation questions.
  • CNAG_04664, mito mRPL54. JEC21 CNJ00980, again 5’ annotation questions, d1ATG looks like start codon.
  • CNAG_00427, mitochondial protein Phosphatase, PTC5 homolog. Difficult to tell 5’ end details, but H99 5’ end is longer. Many short insertions in H99 relative to JEC21. Also do get weak mito prediction from CNA04120 N-term ext.
  • CNAG_01522, mitochondrial phe-tRNA synthetase. JEC21 CNC06600, initiation downstream of original aATG, upstream of new aATG.
  • CNAG_07352, mito complex MICOS MIC26 homolog. JEC21 CNA03190, short 5’UTR, initiation downstream of annotated ATG, also alt 5’UTR intron. However since N-term seqs are near-identical, probably CNA03190 misclassified by mitofates.
  • CNAG_05058, mito ribosome-associated GTPase MTG2 homolog. Looks like CNL05610 is misclassified by mitofates.
  • CNAG_01145, australin/borealin related chromosome passenger protein

  • CNAG_00304, uncharacterized hypothetical protein. Misannotated. Translation starts at dATG, shorter ORF has strong mito. prediction.
  • CNAG_02033, uncharacterized protein with signal sequence and mito RPL51-like domain. maybe CNAG_02033 initiates downstream from aATG?
  • CNAG_03649, hypothetical protein with transmembrane domains. Suspect false negative mitofates.
  • CNAG_03403, mitochondrial gene expression activator, DPC29 homolog. Suspect initiation downstream of ATG in H99.
  • CNAG_02794, mito Dihydroorotate dehydrogenase, distant (cyto) URA1 homolog, pombe (mito) ura3 homolog. H99 ribosome fragments at upstream out-of-frame ATG.
  • CNAG_06328, hypothetical protein, some fungal homologs, alpha/beta barrels

GO analysis of conserved genelists - OUT OF DATE Oct 2018

This was done on 25th June, with values generated by CryptoATGcontext then. Not a reproducible analysis here!

I performed GO analysis with PANTHER.db on JEC21 gene names. PANTHER version 13.1 Released 2018-02-03, Overrepresentation test on GOslim terms.

Link: http://www.pantherdb.org/tools/compareToRefList.jsp

dAUG score high compared aATG and dATG out of frame, 14 genes

File dvsaATG_highdiffn_outframe_cc.txt.

No significant GO terms.

dAUG score high compared aATG and dATG in frame, 44 genes

File dvsaATG_highdiffn_inframe_cc.txt.

Enriched in Biological processes:

  • tRNA aminoacylation for protein translation< translation < protein metabolic process < primary metabolic process < metabolic process
  • tRNA metabolic process < RNA metabolic process < nucleobase-containing compound metabolic process
  • Unclassified

Molecular Function:

  • aminoacyl-tRNA ligase activity < ligase activity < catalytic activity

Cellular Component:

  • cytosol < cytoplasm < intracellular < cell part

Highly translated, 291 genes

File hiTrans_cc.txt.

Enriched BPs include:

  • oxidative phosphorylation
  • glycolysis
  • translation
  • protein folding
  • cation transport
  • mitochondrial transport

Enriched MFs include:

  • transmembrane transporter activity
  • structural constituent of ribosome
  • translation elongation factor activity

Enriched CCs include:

  • proton-transporting ATP synthase complex
  • ribosome
  • mitochondrial inner membrane

High translation efficiency, 174 genes

File hiTE_cc.txt.

Enriched BPs include:

  • pentose-phosphate shunt < monosaccharide metabolic process
  • tRNA aminoacylation for protein translation < translation
  • acyl-CoA metabolic process
  • tricarboxylic acid cycle
  • cellular amino acid biosynthetic process
  • nuclear transport

Enriched MFs include:

  • translation elongation factor activity
  • translation initiation factor activity
  • aminoacyl-tRNA ligase activity

Enriched CCs include:

  • cytosol < cytoplasm

Low translation efficiency, 284 genes

File loTE_cc.txt.

Enriched BPs include:

  • anion transport

Enriched MFs, no sig. results.

Enriched CCs, no sig. results.

Back to table of contents

Composite figures for paper

Figure 2 panels: TE and uAUG usage in H99

Supp fig2: TE and uAUG usage in JEC21

Figure 3 panels: score and AUG/uAUG usage in H99

Figure S3 panels: score and AUG/uAUG usage in JEC21

Figure 6 top: score and AUG/dAUG usage in H99